[ci][deps][1/3] PY313 DEP UNIFICATION: compiling requirements_compiled_py3.13.txt and depsets#62864
[ci][deps][1/3] PY313 DEP UNIFICATION: compiling requirements_compiled_py3.13.txt and depsets#62864elliot-barn merged 23 commits intomasterfrom
Conversation
Signed-off-by: elliot-barn <[email protected]>
There was a problem hiding this comment.
Code Review
This pull request adds Python 3.13 support by introducing a dependency compilation step in the CI pipeline and updating various requirement files and lock files. The review identifies several critical issues, including the use of non-existent package versions for scipy, networkx, keras, onnxruntime, and protobuf. Additionally, the pip-compile command in the CI script is missing the necessary flag to target Python 3.13, and a regex pattern used for stripping version suffixes contains a syntax error that needs correction.
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
test_python_object_leak enables gc.set_debug(gc.DEBUG_SAVEALL) and asserts len(gc.garbage) == 0 after 100 exception-raising actor calls. DEBUG_SAVEALL captures every unreachable cycle the collector finds, including one-shot cycles from dependencies' import-time class definitions — making the assertion prone to false positives any time a dep adopts an ABCMeta-based class. Add gc.freeze() after the existing gc.collect() in both actor __init__ methods. gc.freeze() moves everything currently tracked by the collector into a permanent generation that gc.collect() never re-scans. The test now measures only cycles created by the workload itself — what it was always meant to catch. The strict == 0 assertion is preserved. Exposed on the py3.13 compile-refresh branch after pyarrow 19.0.1 -> 23.0.1. pyarrow 21 (apache/arrow#45818) added collections.abc.Sequence / Mapping bases to ListScalar / StructScalar / MapScalar, putting ABCMeta in the metaclass chain. Cython's class-construction leaves a 6-object cycle (class + MRO tuple + dict + __abstractmethods__ frozenset + _abc_data + getset_descriptor) that every Ray worker inherits at startup. gc.freeze() handles this and any future ABCMeta-adopting dep. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: elliot-barn <[email protected]>
The test-only redis server cert at python/ray/tests/tls/redis.crt had no
Subject Alternative Name extension — only Subject: CN=Server. Modern TLS
clients with hostname verification enabled reject the cert when connecting
to 127.0.0.1 / localhost, causing test_redis_tls to time out with
redis logging:
Error accepting a client connection: error:14094412:SSL routines:
ssl3_read_bytes:sslv3 alert bad certificate
Re-sign redis.crt under the existing CA with
subjectAltName=DNS:localhost,IP:127.0.0.1,IP:::1. Private keys
(ca.key and redis.key) are unchanged — only the cert is re-issued. Update
the README with the corrected openssl recipe and a note on why SAN is
required.
Exposed on the py3.13 compile-refresh branch after redis-py bumped from
4.5.4 to 7.4.0. In redis-py >= 5.0 the SSLConnection default for
ssl_check_hostname flipped from False to True (see
redis/[email protected]/redis/connection.py:1919). Ray's get_redis_cli()
creates redis.Redis(..., ssl_cert_reqs="required") without explicitly
passing ssl_check_hostname, so it picks up the library default.
Fixing the cert conforms to standard TLS expectations rather than opting
out of hostname verification. Browsers deprecated CN-only certs in 2017;
production PKI has shipped SAN-bearing certs by default for years.
No customer-facing TLS behavior changes — this only adjusts a CI fixture.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
Base_build installs ray via test-requirements.txt's `ray>=2.47.1` floor, which leaks into the corebuild layer. When PyPI shipped ray 2.55.1 (2026-04-22 20:09 UTC) the next base_build rebuild picked it up and broke the `:ray: core: dashboard tests [g6_s15]` postmerge job starting at build #17201 (last green #17195). Even though ci/ray_ci/tests.env.Dockerfile force-reinstalls editable ray 3.0.0.dev0, the stale 2.55.1 tree was leaving enough behind to fail ray._common imports and the backwards-compat conda script. Drop the inherited wheel at the end of core.build so tests.env lands editable ray on a clean slot, independent of whatever version the base_build image happens to carry. Signed-off-by: elliot-barn <[email protected]>
Commit 2c26cdb added gc.collect() + gc.freeze() to both actor __init__ methods to exempt pyarrow's ABCMeta-based ListScalar/StructScalar class cycle (introduced in pyarrow 21) from DEBUG_SAVEALL. That fix held on py3.13, but the py3.10 core-build depset still fails with "assert 6 == 0": the 6-object ABCMeta cycle (class + MRO tuple + dict + __abstractmethods__ frozenset + _abc_data + getset_descriptor) shows up from the very first f()/gen() call. Root cause: on py3.10 the Ray worker doesn't eagerly import pyarrow at startup. __init__ freezes before pyarrow's classes exist, so the cycle they later create lands in a non-frozen generation and DEBUG_SAVEALL catches it. Explicitly `import pyarrow` at the top of both __init__ methods so the ABCMeta cycle is present at freeze time. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: elliot-barn <[email protected]>
requirements_compiled_py3.13.txt resolved `redis` to 7.4.0 on this branch; master's py3.13 compiled file and the legacy requirements_compiled.txt both still pin 4.5.4. The 4.x → 7.x jump crosses the redis-py 5.x and 6.x majors (multiple async/pubsub API changes). core-redis-1 shard fails deterministically on test_worker_graceful_shutdown.py::test_ray_get_during_graceful_shutdown[asyncio] — the asyncio actor receives SIGTERM but the in-flight ray.get doesn't complete, and the worker exits with SYSTEM_ERROR / SIGTERM code 1. redis-py 7.x is the most plausible culprit on the shutdown path. Pin in the py313 source file per the source-only pinning workflow, so the next recompile can't drift above 4.5.4. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: elliot-barn <[email protected]>
requirements_compiled_py3.13.txt resolved `datasets` to 4.8.4 on this
branch; master's py3.13 compiled file resolves 3.6.0. HF datasets 4.0
(July 2025) was a major break — removed script-based dataset support
and changed schema/dtype behavior that cascades into
ray.data.from_huggingface → iter_torch_batches → accelerate's
gather_for_metrics.
traingputests / traingpu2tests shards fail deterministically on
//python/ray/train:accelerate_torch_trainer and
:accelerate_torch_trainer_no_raydata with:
File ".../evaluate_modules/.../glue.py", line 84, in simple_accuracy
return float((preds == labels).mean())
AttributeError: 'bool' object has no attribute 'mean'
This surfaces when `metric.add_batch` is never called (or preds and
refs end up as scalars), so `[] == []` in the glue accuracy fn returns
a Python bool instead of an ndarray. Pinning datasets to 3.6.0 matches
the last-known-good CI baseline on master.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
PR #62643 ([Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy) flipped the default OOM killing policy from the legacy owner-group, single-worker policy to the new time-based, multi-worker policy. Two tests in test_memory_pressure.py assert on the legacy semantics: - test_memory_pressure_kill_newest_worker asserts exactly one named actor survives after the OOM kill - test_newer_task_not_retriable_kill_older_retriable_task_first asserts the older retriable task is the one chosen to be killed Under the new policy these assertions aren't guaranteed, and the mempress CI job on compile-req-compiled-py313 fails all 3 bazel attempts (the raylet log itself prints the "RAY_worker_killing_policy _by_group" escape hatch). Set `worker_killing_policy_by_group=True` in the `_system_config` of both `ray_with_memory_monitor` fixtures to pin the legacy policy these tests were written against. Not specific to the py3.13 dep refresh — master inherits the same regression from PR #62643; fixing here to unblock the refresh branch's CI. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
aslonnie
left a comment
There was a problem hiding this comment.
this PR is way too big.. please either break it down to smaller pieces, or help me on how to reason about the changes here..
`test_backwards_compatibility.sh:39` was `conda create -y -n <env> python=<version>` without listing `pip`. conda-forge's `python` package stopped bundling `pip` as a dep, so after `conda activate <env>` the shell has python from the new env but `pip` falls through PATH to the base miniforge env's pip. That base pip sees the editable ray==3.0.0.dev0 in base site-packages, uninstalls it, and installs ray==2.0.1 into base. Meanwhile `python -c "import ray"` runs against the new conda env (which has neither pip nor ray) and fails with ModuleNotFoundError. The cascade breaks every subsequent dashboard test in the same shard that does `from ray._common.test_utils import ...`: ray 2.0.1 now sits in base and predates the `ray._common` module, so the import fails with ModuleNotFoundError: No module named 'ray._common' across test_ray_actor_events and friends. Commit 93802b9 ([ci] uninstall ray in corebuild after depset install) addressed a related but distinct symptom (stale PyPI ray==2.55.1 leaking into the corebuild layer) and wasn't enough on its own — the conda subprocess clobbers the editable install at test runtime regardless of which ray the image ships with. Adding `pip` to the `conda create` arg list keeps pip inside the new env, so `pip install` operates on the new env instead of base. The editable ray in base is preserved for the rest of the shard. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]> Signed-off-by: elliot-barn <[email protected]>
| if [[ "$HAS_TORCH" == "0" ]]; then | ||
| pip uninstall -y torch | ||
| fi | ||
| ) |
There was a problem hiding this comment.
Duplicated compile function risks future divergence
Low Severity
compile_313_pip_dependencies is a near-verbatim copy of compile_pip_dependencies, differing only in the default TARGET and the list of input requirement files. The duplicated scaffolding (pip-tools install, torch detection, sed post-processing, torch cleanup) means a fix or change to one function can easily be missed in the other. Extracting the shared logic into a helper that accepts the target and file list as parameters would eliminate this risk.
Reviewed by Cursor Bugbot for commit fc82923. Configure here.
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
…ile-req-compiled-py313
… fixtures" This reverts commit 943f417. The test_memory_pressure OOM-policy fix lives on its own branch (elliot-barn-fix-test-memory-pressure-oom-policy) and is not required for the py3.13 dependency refresh. Signed-off-by: elliot-barn <[email protected]> Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
8f6eb8e to
0584e88
Compare
This reverts commit f209a0e. Signed-off-by: elliot-barn <[email protected]>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5dec002. Configure here.
Signed-off-by: elliot-barn <[email protected]>
Signed-off-by: elliot-barn <[email protected]>
* [ci] Migrate LLM auto-select and multi-node compute configs to new schema (#62873) Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]> * [serve] Deflake test_haproxy_metrics against HAProxy soft-reload (#62930) test_haproxy_metrics asserts `haproxy_backend_http_responses_total{proxy="http-default",code="2xx"} 1` after one request. The counter is racy: - HAProxy backend health checks can increment it above 1, and - a HAProxyManager soft-reload (which fires on every backend config change) can zero it in the new worker. Also, CI failures are unreadable today because pytest truncates the metrics body in `assert x in y` to "...Har...". Fix: poll with wait_for_condition, send a request each iteration, accept counter >= 1. Also dump full /metrics on timeout so the next failure is debuggable. Passes 5/5 locally --------- Signed-off-by: Seiji Eicher <[email protected]> * [data][1/n] DataSourceV2: refactor V2 listing/scanner/reader infrastructure (#62975) Internal refactor of V2 listing/scanner/reader infrastructure to prep for the upcoming ListFiles/ReadFiles op split. No public API change. - Listing: partition-column helpers on FileManifest, sample_files + _build_pruners helpers in listing_utils. - FileReader.read(manifest): cached_property file_dataset_schema, _broadcast_partition_value helper, derived_items synthesis loop, early-return on empty manifest. Caller-supplied schema overrides pyarrow's per-fragment inference for the all-null first-file case. - FileScanner: drop bucketing helper plan() (moved upstream to plan_list_files_op in PR-A2), add prune_manifest hook, keep compute_local_scheduling (used by V1 dispatch until PR-D). - ArrowFileScanner / ParquetFileReader / Scanner: simplifications aligned with the new manifest-driven read path. - arrow_block.py + dataset.py: Schema.names hides _bsp_stub stub column produced when the scanner emits zero-column batches. This is breaking up PR: #62880 Co-authored-by: Goutam V. <> * [Docs] Replace deprecated busyboxplus curl image in Kubernetes examples (fixes #61538) (#63019) ## Summary Fixes broken Kubernetes example in RayService quickstart docs. The image `radial/busyboxplus:curl` is no longer usable due to deprecated Docker manifest format, causing ImagePullBackOff errors. ## Changes - Replaced `radial/busyboxplus:curl` with `curlimages/curl:latest` ## Testing - Verified the new image works with `kubectl run` - Confirmed curl commands execute successfully inside the pod ## Issue Closes #61538 --------- Signed-off-by: Chaitanya Bharadwaj <[email protected]> Signed-off-by: Chaitanya Bharadwaj <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [serve] Evict per-deployment LongPollHost state on deployment delete (#62820) ## Problem `LongPollHost` had no eviction API. The deletion path in `DeploymentStateManager.update()` cleaned scheduler, autoscaler, and `_deployment_states` but never told the long-poll host. Three per-deployment keys — `(DEPLOYMENT_TARGETS, id)`, the Java-compat `(DEPLOYMENT_TARGETS, name)`, and `(DEPLOYMENT_CONFIG, id)` — survived for the life of the controller, bounded by unique `(name, app_name)` pairs. It also meant **handle routers** (the routers embedded in `serve.get_deployment_handle(...)`, in replicas or user driver code) never received `is_available=False` on delete. `is_available` is derived from `not _terminally_failed()` at `deployment_state.py:3198-3216`, not from "deleting"; healthy deletes emit `is_available=True`, and `broadcast_running_replicas_if_changed` can even early-return and emit nothing at all. Requests through the handle then queue or hang instead of failing fast with `DeploymentUnavailableError`. HTTP/gRPC proxies are unaffected — they subscribe to `ROUTE_TABLE`, which `EndpointState.delete_endpoint()` handles correctly. ## Fix - **`LongPollHost.remove_keys(keys)`** — pops the four per-key maps, decrements the pending-clients gauge by the number of woken waiters, fires each waiter's event. - **`listen_for_change` hardening** — done branch skips evicted keys (was `KeyError`); `not_done` cleanup uses `.get()` instead of indexing to avoid resurrecting `defaultdict` entries; empty sets are popped. - **Delete path** — tombstones `DEPLOYMENT_TARGETS` via `notify_changed` and evicts only `DEPLOYMENT_CONFIG`. The tombstoned key is intentionally *not* evicted in the same sync tick: parked waiters run only after `update()` returns, by which point the done-branch guard would drop the tombstone. - **Batched gauge writes** (per Gemini review) — collect affected namespace tags, flush one `pending_clients_gauge.set(...)` per unique tag after each loop. After this, handle routers flip to `is_available=False` within ms of delete and raise `DeploymentUnavailableError` immediately, rather than relying on side channels (handle lifetime, driver GC, caller timeouts) to eventually notice. --------- Signed-off-by: harshit <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> * [Core] remove pydantic v1 support (#62716) ## Description Drop Pydantic v1 support in Ray and require Pydantic v2 for Ray extras that depend on it. Removing Pydantic v1 support instead of keeping an additional compatibility fix for Python 3.14. This makes the dependency behavior clearer and lets us delete v1-specific compatibility code. ## Related issues https://github.com/ray-project/ray/issues/62664 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zhilong <[email protected]> * [Data] Reduce verbosity of arrow conversion warning logs (#61486) ## Description When Arrow conversion fails and Ray Data falls back to pickle serialization, the warning log includes the full exception traceback (`exc_info=ace`), which can be extremely noisy — especially for nested datatypes like image arrays where the data representation alone spans many lines. This PR moves the detailed error message and traceback to `DEBUG` level, keeping the `WARNING` concise and actionable: **Before:** ``` WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array due to: Error converting data to Arrow: [[array([[[130, 118, 255], [132, 117, 255], ...]]]...; falling back to serialize as pickled python objects Traceback (most recent call last): File ".../arrow.py", line 258, in _convert_to_pyarrow_native_array ... (10+ lines of traceback) ``` **After:** ``` WARNING arrow.py:290 -- Failed to convert column 'flat_images' into pyarrow array; falling back to serialize as pickled python objects. To see the full error, set logging level to DEBUG. ``` ## Related issues Fixes #57840 ## Additional information - The full error details + traceback are still available at `DEBUG` level for anyone who needs to investigate - All existing unit tests pass (`test_transform_pyarrow.py`, `test_arrow_type_conversion.py`) - The `ArrowConversionError` already truncates data to 200 chars, but even that plus the traceback was excessively verbose for a warning --------- Signed-off-by: slxswaa1993 <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Richard Liaw <[email protected]> * [serve] Increase controller benchmark frequency (#63029) ## Description We need denser benchmark results to identify regressions. ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Jeffrey Wang <[email protected]> * [ci][deps][1/3] PY313 DEP UNIFICATION: compiling requirements_compiled_py3.13.txt and depsets (#62864) Refreshes `requirements_compiled_py3.13.txt` and the full set of raydepsets locks against current source pins, and adds the supporting CI plumbing and source-file changes needed to make the py3.13 lock resolvable as a constraint across all py3.10/3.11/3.12/3.13 depsets. ## CI infrastructure - **`.buildkite/dependencies.rayci.yml`** — new `pip_compile_313_dependencies` Buildkite step (mirror of the existing 3.11 compile job). Runs `compile_313_pip_dependencies`, uploads the artifact, and fails the build if `requirements_compiled_py3.13.txt` drifts from source. - **`ci/ci.sh`** — new `compile_313_pip_dependencies()` function that points pip-compile at the `python/requirements/py313/` and `python/requirements/ml/py313/` overrides and emits `requirements_compiled_py3.13.txt`. ## Source-file pins These drive the lock changes — no manual edits to the generated lock files. ### `python/requirements/py313/test-requirements.txt` - `fastapi==0.121.0` — FastAPI 0.125+ removed `pydantic.v1` route support; `test_pydantic_serialization` still uses v1 BaseModel. - `asgiref==3.9.2` — 3.10+ regresses Serve direct-ingress timeout / disconnect handling. - `redis==4.5.4` — TLS test compatibility. - `opentelemetry-proto==1.39.0` and `opentelemetry-exporter-otlp-proto-grpc==1.39.0` — co-pinned with `opentelemetry-sdk` so vllm (rayllm depset) can satisfy the in-family pins. - `grpcio==1.76.0` + matching `grpcio-tools` / `grpcio-status` — bisecting `test_raylet_and_agent_share_fate` against grpcio 1.80 startup cost on the runtime-env agent. - `jsonschema>=4.23.0,<4.25.0` — 4.25 introduced `rfc3987-syntax` which pins `lark==1.3.1`, conflicting with vllm's `lark==1.2.2`. - Dual `python_version`-marker pins for `protobuf`, `scipy`, `contourpy`, `networkx` — these packages dropped py3.10 wheels at the same time the py3.13 lock needed newer floors. Dual pinning preserves the cross-py-version compat path when the py3.13 lock is consumed as a constraint by py3.10 depsets. ### `python/requirements/ml/py313/` - `data-requirements.txt` — `lance-namespace==0.6.1`. - `dl-cpu-requirements.txt` / `dl-gpu-requirements.txt` — `nvidia-nccl-cu12` aligned across CPU/GPU so the CPU-built lock doesn't pin a version that conflicts with cu128 torch in GPU depsets. - `ml-requirements.txt` — dual `keras` pin (3.12.1 for py<3.11, 3.14.0 for py>=3.11); keras 3.13 dropped py3.10. - `rllib-requirements.txt` — dual `onnxruntime` pin (1.20.0 / 1.24.4) keyed on python version. - `train-requirements.txt` — `datasets==3.6.0`. ### `python/requirements/data/` - `pyarrow-latest.txt` — added `delta-sharing`. - `pyarrow-v9.txt` — pinned `datasets==2.14.4`, added `delta-sharing`. ## Depsets config **`ci/raydepsets/configs/ci_data.depsets.yaml`** — added relax entries so v9 / tfxbsl resolves can downgrade chains together: - `relaxed_data`: relaxed `delta-sharing`, `dill`, `multiprocess` (datasets 2.14.4 caps `dill<0.3.8` but py313 lock has `dill==0.4.1`). - `relaxed_data_tfxbsl`: relaxed `absl-py`, `grpcio-status`, `contourpy`, `scipy`, `delta-sharing` (tfx-bsl 1.16.x caps `absl-py<2.0.0` and `protobuf<6`; contourpy 1.3.3 + apache-beam 2.53.0 numpy clash). ## Lock files Regenerated `requirements_compiled_py3.13.txt` and ~70 depset locks under `python/deplocks/` (base / ci / llm / ray_img / docs). --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> * [ci] Fix mismatch between bisect instance-type and runner-queue name (#62742) ## Description A mismatch in the `instance_type` and `runner_queues` fields of bisect pipeline rayci configs causes all `bisect` pipeline builds to fail. ## Related issues None ## Additional information https://buildkite.com/ray-project/bisect/builds/3673/steps/canvas?sid=019d9d9d-05de-4326-b5dc-d818fbcdc71f&tab=output Signed-off-by: sai.miduthuri <[email protected]> * [ci] Migrate dataset GPU core compute configs to new schema (#62832) ## Summary Migrates 2 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all corresponding test entries in `release_data_tests.yaml`. ### Compute configs migrated (2 files) **Dataset tests** (`release/nightly_tests/dataset/`): - `fixed_size_gpu_compute.yaml` - `autoscaling_gpu_compute.yaml` ### Tests updated in release_data_tests.yaml (3 tests) Via `{{scaling}}_gpu_compute.yaml` template: 1. `image_classification_{{scaling}}` 2. `image_classification_from_parquet_{{scaling}}` Hardcoded `dataset/autoscaling_gpu_compute.yaml` (chaos test overrides `working_dir: nightly_tests`): 3. `image_classification_chaos` ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - Removed `region: us-west-2` - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` → `advanced_instance_config` - Dropped head/worker `name:` fields (single worker group per config) - Dropped head-node `resources: {cpu: 0}` — new SDK defaults head CPU to 0 when `worker_nodes` is present (head is CPU-only coordinator; GPU workloads run on `g4dn.2xlarge` workers) ## Test plan - [x] Both config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all 3 test entries: https://buildkite.com/ray-project/release/builds/89918 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]> * [ci] Migrate scheduling and single node benchmark compute configs to new schema (#62489) ## Summary - Migrated 4 compute config files to the new Anyscale SDK schema: `scheduling.yaml`, `scheduling_gce.yaml`, `single_node.yaml`, `single_node_gce.yaml` - Updated 2 test entries (`single_node`, `scheduling_test_many_0s_tasks_many_nodes`) in `release_tests.yaml` with `anyscale_sdk_2026: true` flag - Key transformations: `cloud_id` -> `cloud`, `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes`, flattened `custom_resources`, renamed `advanced_configurations_json`/`gcp_advanced_configurations_json` -> `advanced_instance_config`, `use_spot: false` -> `market_type: ON_DEMAND`, `min/max_workers` -> `min/max_nodes` ## Test plan - [x] All 4 configs validated against `ComputeConfig.from_yaml()` - [x] Verify `single_node` nightly tests pass on Buildkite - [x] Verify `scheduling_test_many_0s_tasks_many_nodes` nightly tests pass on Buildkite 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [Serve][LLM] Add rate-limiter logic for per request traceback spam (#62440) Signed-off-by: Vaishnavi Panchavati <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Co-authored-by: Kourosh Hakhamaneshi <[email protected]> * [ci] Migrate dask-on-ray and shuffle compute configs to new schema (#62605) ## Summary - Migrate 9 compute config files (dask-on-ray and shuffle) from legacy Anyscale schema to new SDK schema - Add `anyscale_sdk_2026: true` to 5 test entries in `release_tests.yaml` ## Config files migrated - `release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template.yaml` (AWS, head-only) - `release/nightly_tests/dask_on_ray/dask_on_ray_sort_compute_template_gce.yaml` (GCE, head-only) - `release/nightly_tests/dask_on_ray/1tb_sort_compute.yaml` (AWS, head + 32 workers) - `release/nightly_tests/shuffle/shuffle_compute_multi.yaml` (AWS, head + 3 workers) - `release/nightly_tests/shuffle/shuffle_compute_multi_gce.yaml` (GCE, head + 3 workers) - `release/nightly_tests/shuffle/shuffle_compute_single.yaml` (AWS, head-only) - `release/nightly_tests/shuffle/shuffle_compute_single_gce.yaml` (GCE, head-only) - `release/nightly_tests/shuffle/shuffle_compute_autoscaling.yaml` (AWS, head + 0-19 workers) - `release/nightly_tests/shuffle/shuffle_compute_autoscaling_gce.yaml` (GCE, head + 0-19 workers) ## Test entries updated (anyscale_sdk_2026: true) - `dask_on_ray_100gb_sort` - `dask_on_ray_1tb_sort` - `shuffle_20gb_with_state_api` - `shuffle_100gb` - `autoscaling_shuffle_1tb_1000_partitions` ## Schema changes applied - `cloud_id` → `cloud` (env var name updated) - `head_node_type` → `head_node` (removed `name:` field) - `worker_node_types` → `worker_nodes` (omitted for head-only configs) - `min_workers`/`max_workers` → `min_nodes`/`max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` / `gcp_advanced_configurations_json` → `advanced_instance_config` - GCE: `region` + `allowed_azs` → `zones` - Removed: `region`, `max_workers`, commented-out blocks - Capitalized `cpu` → `CPU` in resources ## Test plan - [x] All 9 configs validated against `ComputeConfig.from_yaml()` - [x] Verify CI passes with new configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [ci] Migrate stress test and placement group compute configs to new schema (#62607) ## Summary Migrates 15 Anyscale compute config files from the legacy schema to the new SDK 2026 schema, and adds `anyscale_sdk_2026: true` to all corresponding test entries in `release_tests.yaml`. ### Compute configs migrated (15 files) **Stress tests** (`release/nightly_tests/stress_tests/`): - `stress_tests_compute.yaml` / `stress_tests_compute_gce.yaml` - `stress_tests_compute_large.yaml` / `stress_tests_compute_large_gce.yaml` - `smoke_test_compute.yaml` / `smoke_test_compute_gce.yaml` - `stress_test_threaded_actor_compute.yaml` - `placement_group_tests_compute.yaml` / `placement_group_tests_compute_gce.yaml` - `stress_tests_single_node_oom_compute.yaml` / `stress_tests_single_node_oom_compute_gce.yaml` **Placement group tests** (`release/nightly_tests/placement_group_tests/`): - `compute.yaml` / `compute_gce.yaml` - `pg_perf_test_compute.yaml` / `pg_perf_test_compute_gce.yaml` ### Tests updated in release_tests.yaml (9 tests) 1. `stress_test_placement_group` 2. `stress_test_state_api_scale` 3. `stress_test_many_tasks` 4. `stress_test_dead_actors` 5. `threaded_actors_stress_test` 6. `stress_test_many_runtime_envs` 7. `single_node_oom` 8. `pg_autoscaling_regression_test` 9. `placement_group_performance_test` ### Schema changes applied - `cloud_id` → `cloud`, `ANYSCALE_CLOUD_ID` → `ANYSCALE_CLOUD_NAME` - `head_node_type` → `head_node`, `worker_node_types` → `worker_nodes` - `min_workers` → `min_nodes`, `max_workers` → `max_nodes` - `use_spot: false` → `market_type: ON_DEMAND` - `advanced_configurations_json` / `gcp_advanced_configurations_json` → `advanced_instance_config` - GCE: `region` + `allowed_azs` → `zones` - Resources: `cpu` → `CPU`, `gpu` → `GPU`, flattened `custom_resources` - Removed: `region`, `max_workers`, head/worker `name` fields (kept where multiple workers share instance type) - Removed commented-out blocks - Added `CPU` resources to head nodes where `wait_for_nodes` > worker count ## Test plan - [x] All 15 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with `anyscale_sdk_2026: true` flag on all test entries: https://buildkite.com/ray-project/release/builds/89908 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [ci] Migrate chaos test compute configs to new schema (#62606) ## Summary - Migrated 2 chaos test compute config files (`compute_template.yaml` and `compute_template_gce.yaml`) from legacy Anyscale compute config schema to the new SDK schema - Added `anyscale_sdk_2026: true` flag to all 16 chaos test entries in `release_tests.yaml` ### Config changes - `cloud_id` -> `cloud`, `ANYSCALE_CLOUD_ID` -> `ANYSCALE_CLOUD_NAME` - `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes` - `min_workers`/`max_workers` -> `min_nodes`/`max_nodes` - `use_spot: false` -> `market_type: ON_DEMAND` - `advanced_configurations_json` -> `advanced_instance_config` - Flattened `resources` (removed `custom_resources` nesting, capitalized `cpu` -> `CPU`) - GCE: replaced `region` + `allowed_azs` with `zones` - Removed `region`, `max_workers`, and node `name` fields ### Tests updated (16) - `chaos_many_tasks_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_many_actors_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_streaming_generator_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` - `chaos_object_ref_borrowing_{baseline,kill_raylet,iptable_failure_injection,terminate_instance}` ## Test plan - [x] Both config files validated against `ComputeConfig.from_yaml()` - [x] Verify chaos tests pass on nightly run after merge Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * [ci] Migrate microbenchmark, benchmark-worker-startup, and rllib compute configs to new schema (#62604) ## Summary - Migrate 10 compute config files to the new Anyscale SDK schema (`cloud_id` -> `cloud`, `head_node_type` -> `head_node`, `worker_node_types` -> `worker_nodes`, etc.) - Add `anyscale_sdk_2026: true` flag to 12 test cluster blocks in `release_tests.yaml` ## Config files migrated - `release/microbenchmark/tpl_64.yaml` (AWS, head-only) - `release/microbenchmark/tpl_64_gce.yaml` (GCE, head-only) - `release/microbenchmark/experimental/compute_t4_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_gpu_2x1_aws.yaml` (AWS, head+worker GPU) - `release/microbenchmark/experimental/compute_a100_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_l4_gpu.yaml` (AWS, head-only GPU) - `release/microbenchmark/experimental/compute_l4_gpu_2x1_aws.yaml` (AWS, head+worker GPU) - `release/benchmark-worker-startup/only_head_node_1gpu_64cpu.yaml` (AWS, head-only GPU) - `release/benchmark-worker-startup/only_head_node_1gpu_64cpu_gce.yaml` (GCE, head-only) - `release/rllib_tests/1gpu_16cpus.yaml` (AWS, head-only GPU) ## Tests updated with `anyscale_sdk_2026: true` - `microbenchmark` (base + GCE variation) - `compiled_graphs` - `compiled_graphs_GPU` - `compiled_graphs_GPU_multinode` - `compiled_graphs_GPU_cu130` - `compiled_graphs_GPU_multinode_cu130` - `rdt_single_node_T4_microbenchmark` - `rdt_single_node_A100_microbenchmark` - `benchmark_worker_startup` (base + GCE variation) - `rllib_learning_tests_pong_appo_torch` ## Test plan - [x] All 10 config files validated against `ComputeConfig.from_yaml()` - [x] CI passes with the new configs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: sai.miduthuri <[email protected]> Co-authored-by: Claude Opus 4.6 <[email protected]> * recompiling requirements_compiled_py313.txt Signed-off-by: elliot-barn <[email protected]> * [deps] updating tag on py313 deps (#63033) updating tag on py313 deps to prevent unnecessary compilation in premerge Signed-off-by: elliot-barn <[email protected]> * [core] fix the mypy type check on BaseContext.__exit__ (#62999) ## Description Fix the type error on the `BaseContext.__exit__`. Also added the reported use case to our mypy test case. ## Related issues Fixes https://github.com/ray-project/ray/issues/62971 Signed-off-by: Rueian Huang <[email protected]> * [core] increase the cleanup timeout in the chaos iptable test (#62992) ## Description Increase the waiting time for the cleanup according to the cluster logs from the [failure](https://buildkite.com/ray-project/release/builds/90709#019dd2b5-f78b-47eb-aa8f-331c5c68cad3): ### Timeline - **23:14:00**: Actor workload starts with network failure injection every 60s. - **23:15:00**: First 5s network fault affects head + 4 workers. - **23:16:02**: Raylet reports worker process `10563` did not register within timeout. - **23:16:02-23:17:00**: `ReportActor.add` retries pile up after connection resets; progress stalls near 47%. - **23:18:34-23:20:34**: Head state dumps show `128` total worker CPUs and `0` available while actor work is still running. - **23:21:34**: Head sees `128` total CPUs, `112` available. Missing `16` CPUs are all on `10.0.45.36`. - **23:21:43**: Worker `10.0.45.36` reports 16 `ReportActor.__init__` workers, each holding `1 CPU`. - **23:21:47**: Those 16 `ReportActor` workers disconnect gracefully. - **23:21:49**: `wait_for_condition` times out before observing all CPUs released; another network fault triggers at the same time. So, increasing the cleanup consistency timeout should likely fix this specific failure. ## Related issues Fixes: https://github.com/anyscale/ray/issues/1534 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Rueian Huang <[email protected]> Signed-off-by: Rueian <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [core] move observability pubsub to ObservabilityPubSubService (#62806) This PR is a follow-up to https://github.com/ray-project/ray/pull/62461, which isolates 3 pubsub channels that have lower priorities and are not for the critical control plan from the InternalPubSubGcsService to their own io_context and the new ObservabilityPubSubService: pubsub_pb2.RAY_ERROR_INFO_CHANNEL pubsub_pb2.RAY_LOG_CHANNEL pubsub_pb2.RAY_NODE_RESOURCE_USAGE_CHANNEL This will ensure that they won't block the critical control plan. The new ObservabilityPubSubRpcClient client also allows us to move the service out of GCS if needed in the future. --------- Signed-off-by: Rueian Huang <[email protected]> Signed-off-by: Rueian <[email protected]> * [ci] Fix doc build failing on broken pytorch intersphinx inventory (#63038) - The doc build (`make -C doc html`, which runs `sphinx-build -W --keep-going`) is failing with `build finished with problems, 1 warning`. The single Sphinx warning is an intersphinx fetch failure: `https://pytorch.org/docs/stable/objects.inv` 301s to `https://docs.pytorch.org/docs/stable/objects.inv`, which currently 404s upstream. With `-W`, that one warning fails CI. - Repoint the `torch` intersphinx mapping in `doc/source/conf.py` to bypass the broken `/stable/objects.inv`. The base URL stays at the canonical `https://docs.pytorch.org/docs/stable/` so generated cross-reference links still target /stable/, but the inventory is fetched from a working pinned version: `https://docs.pytorch.org/docs/2.7/objects.inv`. - Pin matches Ray's runtime torch version (`torch==2.7.0` in `python/requirements/ml/dl-{cpu,gpu}-requirements.txt`), so cross-refs only resolve to symbols that actually exist in the torch users get. ## Why pin to 2.7 and not /stable/ or /main/ - `/stable/objects.inv` is the upstream-broken URL we're routing around, so it can't be the source. - `/main/objects.inv` works but tracks the development branch, which can index APIs that don't exist in 2.7 — leading to cross-refs resolving to symbols Ray users can't actually call. - `/2.7/objects.inv` matches the runtime exactly. Tradeoff: when Ray bumps torch, this URL needs to bump alongside the requirements pin. Post merge run: https://buildkite.com/ray-project/postmerge/builds/17329 Signed-off-by: elliot-barn <[email protected]> * [observability] add instance filter to gpu usage metric query (#62214) ## Description Adds instance filter to the node gpu usage metric panel Signed-off-by: carolynwang <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> * Remove redundant flaky integration test in favor of unit tests (#63004) ## Description PR [[Core] (Resource Isolation 12/n) Switch group killing policy to by time killing policy](https://github.com/ray-project/ray/pull/62643), enabled the new by-time killing policy by default opposed to the legacy by-group killing policy. This resulted in `test_memory_pressure` failures in post merge. We found the following in our investigation: * The integration test tests for policy specific behaviors when the memory pressure integration test suite should instead tests for the memory monitoring system's general ability to reduce memory pressure. * The failing integration test should be unit test that tests for the killing policy's behavior directly. In general, we prefer unit test over integration tests for memory threshold sensitive tests as the test environment can have significant impact on the test result, leading to flaky test behaviors. This PR removes redundant integration tests that tests for policy specific behaviors already covered by the policy's unit testing, and introduces a new unit test for cases that were previously covered by the integration test. The following are the removed integration test and their replacements: * `test_restartable_actor_oom_retry_off_throws_oom_error` -> redundant to `test_restartable_actor_throws_oom_error` * `test_memory_pressure_kill_newest_worker` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_memory_pressure_kill_task_if_actor_submitted_task_first` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_task_oom_no_oom_retry_fails_immediately` -> replaced by `TestTaskOomKillNoOomRetryFailsImmediately` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_task_oom_only_uses_oom_retry` -> replaced by `TestTaskOomInfiniteRetry` in https://github.com/ray-project/ray/blob/master/src/ray/core_worker/tests/task_manager_test.cc * `test_newer_task_not_retriable_kill_older_retriable_task_first` -> replaced by `TestPolicyPrioritizesRetriableOverNonRetriable` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc * `test_put_object_task_usage_slightly_below_limit_does_not_crash` -> replaced by `TestMonitorDetectsMemoryBelowThresholdCallbackNotExecuted` in https://github.com/ray-project/ray/blob/master/src/ray/common/tests/threshold_memory_monitor_test.cc * `test_last_task_of_the_group_fail_immediately` -> replaced by `TestLastWorkerInGroupShouldNotRetry` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_group_by_owner_test.cc * `test_one_actor_max_lifo_kill_next_actor` -> replaced by `TestPolicyPrioritizesNewerWorkersWithinSameRetriability` in https://github.com/ray-project/ray/blob/master/src/ray/raylet/tests/worker_killing_policy_by_time_test.cc ## Additional information `test_memory_pressure` run: https://buildkite.com/ray-project/postmerge/builds/17288 --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> * [Train] Add missing %s to logger.debug (#63039) `logger.debug` was missing the %s and as a result clogging up the logs --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> * Add perf metrics for 2.55.0 (#63060) ``` REGRESSION 52.54%: tasks_per_second (THROUGHPUT) regresses from 386.6133448073775 to 183.49078025658062 in benchmarks/many_nodes.json REGRESSION 37.10%: tasks_per_second (THROUGHPUT) regresses from 594.0367087794571 to 373.6653345877981 in benchmarks/many_tasks.json REGRESSION 4.22%: single_client_tasks_and_get_batch (THROUGHPUT) regresses from 5.723101265712336 to 5.481786077048712 in microbenchmark.json REGRESSION 4.09%: multi_client_put_gigabytes (THROUGHPUT) regresses from 42.60577675231464 to 40.8627833341568 in microbenchmark.json REGRESSION 1.86%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.982001139120161 to 0.9637211637507427 in microbenchmark.json REGRESSION 0.84%: client__get_calls (THROUGHPUT) regresses from 1119.7606509262687 to 1110.3815800718512 in microbenchmark.json REGRESSION 0.63%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2985.2594797119345 to 2966.3149904468737 in microbenchmark.json REGRESSION 0.48%: client__put_calls (THROUGHPUT) regresses from 851.7996054229982 to 847.7132252307356 in microbenchmark.json REGRESSION 289.14%: dashboard_p95_latency_ms (LATENCY) regresses from 37.856 to 147.311 in benchmarks/many_pgs.json REGRESSION 135.33%: dashboard_p99_latency_ms (LATENCY) regresses from 798.453 to 1879.035 in benchmarks/many_pgs.json REGRESSION 110.53%: stage_4_spread (LATENCY) regresses from 0.3184540688712737 to 0.6704279092079272 in stress_tests/stress_test_many_tasks.json REGRESSION 48.31%: avg_pg_remove_time_ms (LATENCY) regresses from 1.154493106606675 to 1.7122211741741544 in stress_tests/stress_test_placement_group.json REGRESSION 34.75%: dashboard_p50_latency_ms (LATENCY) regresses from 5.002 to 6.74 in benchmarks/many_pgs.json REGRESSION 21.20%: stage_0_time (LATENCY) regresses from 7.112839698791504 to 8.620674133300781 in stress_tests/stress_test_many_tasks.json REGRESSION 19.38%: stage_3_creation_time (LATENCY) regresses from 2.621494770050049 to 3.1294972896575928 in stress_tests/stress_test_many_tasks.json REGRESSION 8.31%: dashboard_p95_latency_ms (LATENCY) regresses from 42.959 to 46.531 in benchmarks/many_nodes.json REGRESSION 8.03%: 107374182400_large_object_time (LATENCY) regresses from 22.459637914999973 to 24.263247010999976 in scalability/single_node.json REGRESSION 8.00%: 10000_args_time (LATENCY) regresses from 11.357349357000004 to 12.265755501000008 in scalability/single_node.json REGRESSION 7.69%: avg_pg_create_time_ms (LATENCY) regresses from 1.5098637252248464 to 1.6259311876874045 in stress_tests/stress_test_placement_group.json REGRESSION 3.67%: 3000_returns_time (LATENCY) regresses from 3.577688757000004 to 3.7088375179999957 in scalability/single_node.json ``` Signed-off-by: Lonnie Liu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]> * [core] rename InternalPubSub* to ControlPlanePubSub* (#63044) ## Description Renaming `InternalPubSub*` to `ControlPlanePubSub*` for clarity. Following up to https://github.com/ray-project/ray/pull/62806#pullrequestreview-4207199543 ## Related issues > Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Rueian Huang <[email protected]> * [Serve][2/5] Add custom ingress request router app interfaces (#62680) Direct ingress needs an app-scoped ingress request router deployment that HAProxy can call to map each request to a target replica ID before forwarding the request to the selected replica. This change attaches that router to the Serve application object itself, so both imperative and declarative deployment paths consume the same composed application graph. ## API shape Imperative usage: ```python llm_server = LLMServer.bind(...) ingress_request_router = IngressRequestRouter.bind( llm_deployment=llm_server, ) app = llm_server._with_ingress_request_router(ingress_request_router) serve.run(app, route_prefix="/v1") ``` Declarative usage: ```python # my_module.py llm_server = LLMServer.bind(...) ingress_request_router = IngressRequestRouter.bind( llm_deployment=llm_server, ) app = llm_server._with_ingress_request_router(ingress_request_router) ``` ```yaml applications: - name: llm route_prefix: /v1 import_path: my_module:app ``` Signed-off-by: Seiji Eicher <[email protected]> Co-authored-by: Claude Haiku 4.5 <[email protected]> * [Core] Match expected resource isolation integration test constraint to new cgroup constraint (#63054) ## Description The resource isolation python integration tests are currently failing because the resource isolation upper bound constraint has been adjusted from `memory.max` to `memory.high` in the latest resource isolation changes without updating the integration test. This PR adjust the resource isolation integration test to match the latest changes in to use `memory.high` upper bound constraint. The resource isolation PR that updated the memory constraint without updating the test: https://github.com/ray-project/ray/pull/62705/changes#diff-60b34dab728b2e51426a465dd712767a8735682e137e52ebfe030123aeeb56d5L69-R77 ## Related issues Fixes failing core: cgroup tests --------- Signed-off-by: davik <[email protected]> Co-authored-by: davik <[email protected]> * [serve] Enable logs in `LongPollHost` when `LongPollClient` stops its attached event loop (#63028) --------- Signed-off-by: Jeffrey Wang <[email protected]> * [Train] Reduce `test_result_restore` flakiness (#63045) Reviewing the logs for a flaky run of `test_result_restore` then it shows that rank 1 has a training report but rank 0 doesn't (the RuntimeError in rank-1 runs before the checkpoint in rank-0 is saved) and therefore when computing the `get_best_checkpoints` there is missing checkpoints and occasionally the wrong results are returned. We can easily resolve this through adding a sync barrier between workers before raising the error to ensure that the checkpoints are all saved. --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> * [Data] Fix HashAggregate duplicate group rows for AggregateFnV2 (#63066) ## Summary `TableBlockBuilder.build()` reordered rows across an internal compaction boundary, so `_aggregate`'s per-block partial-aggregate output could be unsorted by the group key. That violates the "inputs are sorted by key" precondition that `_combine_aggregated_blocks`' `heapq.merge` relies on, and surfaced as duplicate group rows in HashAggregate output whose count varied with the parallelism arg. Two issues, both fixed: 1. **`TableBlockBuilder.build()`** put the still-uncompacted dict-of-lists (newest rows) in front of the previously-compacted tables. Now appends the uncompacted tail after the compacted tables — preserving insertion order. 2. **`ArrowBlockBuilder._combine_tables`** called `transform_pyarrow.concat` without `preserve_order=True`. When block schemas didn't unify exactly (common for V2 aggregators whose accumulator varies in shape between rows — e.g. an empty list vs. a non-empty list, inferring `list<null>` vs `list<string>`), `concat` took a fast path that groups schema-matching blocks together and prepends mismatched ones. Now passes `preserve_order=True` since the builder's contract is to preserve insertion order regardless of internal compaction or schema unification. ## Where `_combine_tables` sits in the hash-shuffle lifecycle ```mermaid sequenceDiagram autonumber participant ShuffleTask as _shuffle_block (Ray task) participant Closure as input_block_transformer<br/>(_aggregate closure) participant TableAcc as TableBlockAccessor._aggregate participant Builder as TableBlockBuilder participant Combine as ArrowBlockBuilder._combine_tables participant Aggregator as HashShuffleAggregator participant Reducer as ReducingAggregation ShuffleTask->>Closure: block_transformer(block) Closure->>Closure: pruned.sort(sort_key) Closure->>TableAcc: target._aggregate(sort_key, aggs) loop for each group (sorted) TableAcc->>Builder: builder.add(row) Note over Builder: _compact_if_needed may flush<br/>_columns into _tables mid-loop end TableAcc->>Builder: builder.build() Builder->>Combine: _combine_tables(_tables + [_columns_partial]) Note over Combine: ★ FIXES LIVE HERE<br/>build(): append uncompacted tail (was: prepend)<br/>_combine_tables: preserve_order=True Combine-->>Builder: sorted partial-aggregate block Builder-->>TableAcc: sorted partial-aggregate block TableAcc-->>Closure: sorted partial-aggregate block Closure-->>ShuffleTask: sorted partial-aggregate block ShuffleTask->>ShuffleTask: hash_partition (np.where + take, preserves order) ShuffleTask->>Aggregator: aggregator.submit.remote(shard) Aggregator->>Reducer: compact / finalize (List[Block]) Reducer->>Reducer: _combine_aggregated_blocks<br/>(heapq.merge — now sees sorted inputs) ``` The bug was in step 8: `_combine_tables` and `build()` could permute rows across compactions, propagating unsorted blocks through steps 9–14 to the `heapq.merge` in step 15, which silently produced duplicate group rows because its consecutive-equal-key grouping only collapses adjacent rows. ## Test plan - [x] New regression test `test_partial_aggregate_preserves_sort_after_builder_compaction` in `python/ray/data/tests/test_hash_shuffle.py` forces compaction on every row via `MAX_UNCOMPACTED_SIZE_BYTES=1` and asserts partial-aggregate output is sorted by the group key. Fails on master, passes after this change. - [x] Full `test_hash_shuffle.py` suite (19 tests) passes. - [x] `test_hash_shuffle_aggregator.py` suite passes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Goutam <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> * [Data][LLM] Fix wrong documented default for max_tasks_in_flight_per_actor (#62917) Signed-off-by: Aydin Abiar <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]> * [train] Export default data execution options (#62784) Follow-up to #59186, which only captured `execution_options` when the user provided them per-dataset in the form of a dict, dropping the default or user-provided global `ExecutionOptions`. This PR captures the default and user-provided global options alongside the per-dataset execution options, exposed via a typed `DataExecutionOptions` model split into `default` and `per_dataset_execution_options`. --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> * [Data] Convert abstract logical operator classes to frozen dataclasses (#62593) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After finishing the remaining concrete operator coverage, the next step in the split plan is to convert `LogicalOperator` and the abstract logical operator base classes into frozen dataclasses. #### What this PR changes: Makes the following abstract logical operator classes frozen dataclasses: - `LogicalOperator` - `NAry` - `AbstractOneToOne` - `AbstractMap` - `AbstractUDFMap` - `AbstractAllToAll` This PR also makes `_name`, `_input_dependencies`, and `_num_outputs` proper dataclass fields on `LogicalOperator`, removing the manual `LogicalOperator.__init__`. Extending the same step to additional abstract-base state runs into concrete dataclass constructor-generation errors (for example, `TypeError: non-default argument 'input_op' follows default argument`), so the broader field-model cleanup remains in the later follow-up PRs. This PR does not include the later `_name` derived-field work, `_apply_transform` deduplication, `input_op: InitVar` replacement, or broader logical-rule cleanup follow-ups. ## Related issues Closes #60312 ## Additional information This PR corresponds to the current split-plan step for making `LogicalOperator` and the abstract logical operator base classes frozen dataclasses. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/map_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_basic.py -k 'map or repartition or sort or union or zip'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'union or split'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_join.py -k 'inner or outer or semi or anti'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert remaining concrete logical operators to frozen dataclasses - This PR: make `LogicalOperator` and the abstract logical operator base classes frozen dataclasses Next: - make `_name` a derived field - deduplicate `_apply_transform` - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` --------- Signed-off-by: yaommen <[email protected]> * [ci] convert core.rayci.yml test steps to array and narrow subsets (#62799) Convert the two remaining matrix test steps in core.rayci.yml — "core: python {{matrix.python}} tests" (matrix setup with python + worker_id) and "core: minimal tests" — to array syntax; their corebuild-multipy and minbuild-core depends_on refine from (*) to ($). Narrow three (*) fan-ins in core.rayci.yml down to (python=3.10) subsets for the wheel tests, HA integration, and runtime env container steps that only exercise python 3.10. Across cicd, data, dependencies, doc, kuberay, llm, ml, others, rllib, and serve, narrow each oss-ci-base_* (*) dependency to (python=X.Y) where the consuming step pins a single python version; leave (*) in place where the step truly spans multiple versions (data top-level ml base, ml mlbuild-multipy / mlgpubuild-multipy, serve top-level build base). Signed-off-by: andrew <[email protected]> Co-authored-by: Lonnie Liu <[email protected]> * [Data] Make logical operator names derived by default (#63084) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After moving the shared logical-operator backing fields to the abstract-class layer, `_name` is still wired manually in many concrete operators. Most of those assignments are just the operator class name, so the next step is to make that default behavior come from the base logical-operator layer. #### What this PR changes: Makes logical-operator names derived by default from the base logical-operator layer. For operators without a special naming rule, `name` now defaults to `self.__class__.__name__`. This PR removes concrete `_name` wiring where the assigned value was only the class name, while preserving the special naming cases that still need explicit values, such as `Read`, `Limit`, `RandomShuffle`, `RandomizeBlocks`, and UDF-based map operators. This PR does not include the later `_apply_transform` deduplication, `input_op: InitVar` replacement, `_get_args` cleanup, or broader logical-rule cleanup follow-ups. ## Related issues Part of #60312 ## Additional information This PR corresponds to the `_name` derived-field step in the current split plan. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py python/ray/data/_internal/logical/operators/count_operator.py python/ray/data/_internal/logical/operators/input_data_operator.py python/ray/data/_internal/logical/operators/from_operators.py python/ray/data/_internal/logical/operators/streaming_split_operator.py python/ray/data/_internal/logical/operators/join_operator.py python/ray/data/_internal/logical/operators/write_operator.py python/ray/data/_internal/logical/operators/map_operator.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_state_export.py python/ray/data/tests/unit/test_logical_plan.py python/ray/data/tests/test_execution_optimizer_basic.py -k 'Project or Count or InputData or Union or Zip or split or join or write or read or map'` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_advanced.py python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'zip or union or split or project or read or write or join'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert the remaining concrete logical operators to frozen dataclasses - PR-Next-1: Convert abstract logical operator classes to frozen dataclasses - This PR: make logical-operator names derived by default Next: - deduplicate `_apply_transform` - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` Signed-off-by: yaommen <[email protected]> * [Data] Deduplicate logical operator apply transform (#63089) ## Description #### Why this is needed: This is the next PR in the `#60312` logical plan migration stack. After making logical operators frozen dataclasses and moving logical operator names to the base layer, most concrete operators still carry near-identical `_apply_transform` implementations. Each implementation recursively transforms its input operator, keeps `self` when the input is unchanged, and rebuilds the operator when the input changes. #### What this PR changes: Adds a frozen-safe default `_apply_transform` implementation to `LogicalOperator` and moves operator-specific rebuild details into small `_with_new_input` / `_with_new_input_dependencies` hooks. For single-input operators, concrete dataclass operators with `input_op` still use `dataclasses.replace(self, input_op=...)`, while generic custom subclasses keep the previous shallow-copy child rewiring behavior. `RandomShuffle` and `Repartition` keep small hooks because they still have InitVar-only constructor values that must be passed during replacement. `NAry` owns the common n-ary rebuild path for `Zip` and `Union`, and `Join` keeps the multi-input rebuild hook for its left and right inputs. This PR does not replace `input_op: InitVar` with a real `input_dependencies` field, remove `input_dependency` from `AbstractOneToOne`, clean up `_get_args`, or clean up logical-rule special-casing. Those remain separate follow-ups in the current split plan. ## Related issues Part of #60312 ## Additional information This PR corresponds to the `_apply_transform` deduplication step in the current split plan. It reopens the same change from #63086 against `master`. ### Tests - `python -m pre_commit run --files python/ray/data/_internal/logical/interfaces/logical_operator.py python/ray/data/_internal/logical/operators/all_to_all_operator.py python/ray/data/_internal/logical/operators/count_operator.py python/ray/data/_internal/logical/operators/join_operator.py python/ray/data/_internal/logical/operators/map_operator.py python/ray/data/_internal/logical/operators/n_ary_operator.py python/ray/data/_internal/logical/operators/one_to_one_operator.py python/ray/data/_internal/logical/operators/streaming_split_operator.py python/ray/data/_internal/logical/operators/write_operator.py python/ray/data/tests/unit/test_logical_plan.py` - `PYTHONPATH=python python -m pytest -q python/ray/data/tests/unit/test_logical_plan.py python/ray/data/tests/test_execution_optimizer_limit_pushdown.py::test_limit_pushdown_recreates_frozen_download` - In #63086: `PYTHONPATH=python python -m pytest -q python/ray/data/tests/test_execution_optimizer_limit_pushdown.py python/ray/data/tests/test_execution_optimizer_advanced.py python/ray/data/tests/test_union.py python/ray/data/tests/test_split.py -k 'limit_pushdown_recreates_frozen_download or zip_e2e or union or split or project or join'` ### Stack Plan Done: - PR-A: Add a default property implementation for `LogicalOperator.name` - PR-B: Move logical `output_dependencies` handling out of logical operators - PR-C: Make `LogicalOperator` an ABC with abstract `num_outputs` - PR-D1: Convert one-to-one logical operators to frozen dataclasses - PR-D2: Convert map logical operators to frozen dataclasses - PR-D3: Convert all-to-all, join, read, and write logical operators to frozen dataclasses - PR-D4: Convert remaining source logical operators to frozen dataclasses - PR-Next-0: Convert the remaining concrete logical operators to frozen dataclasses - PR-Next-1: Convert abstract logical operator classes to frozen dataclasses - PR-Next-2: make logical-operator names derived by default - This PR: deduplicate `_apply_transform` Next: - replace `input_op: InitVar` with a real `input_dependencies` field - remove `input_dependency` on `AbstractOneToOne` - clean up `_get_args` - remove redundant `__repr__` / `__str__` - clean up special-casing in logical rules - finalize equality / comparability work for `#60312` --------- Signed-off-by: yaommen <[email protected]> * [RLlib] Fix ValueError in MultiAgentEpisode.get_rewards() when agent inactive for all requested env steps (#62907) ## Summary Fixes #62903 - `MultiAgentEpisode.get_rewards()` (and other `get_*` methods) no longer crashes with `ValueError` when called on a finalized multi-agent episode where an agent was inactive during the requested env steps. When retrieving per-agent data by env step indices, `_get_single_agent_data_by_env_step_indices` filters out `SKIP_ENV_TS_TAG` entries for agents that didn't participate in certain env steps. If an agent was inactive for **all** requested env steps, the filtered indices list became empty, causing `InfiniteLookbackBuffer.get(indices=[])` → `batch([])` → `ValueError: Input list_of_structs does not contain any items`. This PR adds an early return of an empty list when all indices are filtered out, allowing the caller's existing `if len(agent_values) > 0` guard to correctly exclude the inactive agent from the result dict. <details> <summary>Before</summary> ``` episode.get_rewards(indices=slice(1, 3)) ValueError: Input `list_of_structs` does not contain any items. File "ray/rllib/env/multi_agent_episode.py", line 2554, in _get_data_by_env_steps agent_values = self._get_single_agent_data_by_env_step_indices( File "ray/rllib/env/multi_agent_episode.py", line 2753, in _get_single_agent_data_by_env_step_indices ret = inf_lookback_buffer.get( File "ray/rllib/env/utils/infinite_lookback_buffer.py", line 243, in get data = batch(data) File "ray/rllib/utils/spaces/space_utils.py", line 315, in batch raise ValueError("Input `list_of_structs` does not contain any items.") ``` </details> <details> <summary>After</summary> ```python episode.get_rewards(indices=slice(1, 3)) # {'a0': array([0.2, 0.3])} # a1 correctly excluded — it was inactive during env steps 1 and 2 ``` </details> <details> <summary>Test results</summary> ``` $ python -m pytest test_multi_agent_episode.py -v -x test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_reset PASSED [ 5%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_add_env_step PASSED [ 11%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_cut PASSED [ 17%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_actions PASSED [ 23%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_infos PASSED [ 29%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_observations PASSED [ 35%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_return PASSED [ 41%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_rewards PASSED [ 47%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_sample_batch PASSED [ 52%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_get_state_and_from_state PASSED [ 58%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_init PASSED [ 64%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_len PASSED [ 70%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_other_getters PASSED [ 76%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_setters PASSED [ 82%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice PASSED [ 88%] test_multi_agent_episode.py::TestMultiAgentEpisode::test_slice_with_lookback PASSED [ 94%] test_multi_agent_episode.py::test_multi_agent_episode_functionality PASSED [100%] ======================== 17 passed, 3 warnings in 5.03s ======================== ``` </details> ## Test plan - [x] Reproduced the bug: `get_rewards(indices=slice(1,3))` raises `ValueError` on a finalized episode with an inactive agent - [x] Verified the fix: same call now returns `{'a0': array([0.2, 0.3])}` with the inactive agent correctly excluded - [x] Verified `get_rewards()` without indices still works as before - [x] Added regression test to `test_get_rewards` in `test_multi_agent_episode.py` covering: - Finalized (numpy) episode with `get_rewards(indices=slice(1,3))` — the exact crash scenario - `get_actions(indices=slice(1,3))` — proves the fix covers all `get_*` methods (shared code path) - Non-finalized episode with same scenario — proves finalized/non-finalized behavior is consistent - [x] All 17 tests in `test_multi_agent_episode.py` pass (regression test is inside existing `test_get_rewards`) --------- Signed-off-by: Cursx <[email protected]> * Add shared Claude Code configuration for Ray development (#62554) ## Description: Sets up hierarchical Claude Code instructions for the Ray repo so that each team (Data, Serve, Train, Tune, RLlib, C++ Core) can maintain their own scoped rules and skills. ## Primary changes: - Root `.claude/CLAUDE.md` with shared instructions, per-library templates teams can fill in - Path-scoped `.claude/rules/ `for Python guidelines, C++ style, security, debugging - Shared skills: `/rebuild`, `/lint`, `/fetch-buildkite-logs` for common workflows - `.claude/settings.json` with common permissions - Developer docs at `doc/source/ray-contribute/agent-development.rst` covering personal setup, worktree support, and how to add team-specific rules/skills - `.gitignore` updated to version-control shared config while keeping personal files local reference: https://code.claude.com/docs/en/best-practices ## Future work: Support other coding agents like codex, the instructions can be written in common markdown files and imported inside coding agent specific instruction files. we can also integrate with anyscale managed skills to help debug release tests running on anyscale workspaces. --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Edward Oakes <[email protected]> * refactor(raylet): split task option helpers into task_options_utils.pxi Signed-off-by: hieuvous <[email protected]> * refactor(raylet): move resource task option helpers Signed-off-by: chichic21039 <[email protected]> * Refactor/raylet task options utils resources fallback (#7) * Move function and actor helpers to task options utils * Update task options utils and raylet * Resolve add/add merge conflict in task_options_utils.pxi * Resolve add/add merge conflict in task_options_utils.pxi * refactor(raylet): extract resources and fallback helpers to task_options_utils.pxi Signed-off-by: Duyhung080205 <[email protected]> --------- Signed-off-by: Duyhung080205 <[email protected]> Signed-off-by: Duy Hưng <[email protected]> Co-authored-by: hieuvous <[email protected]> Co-authored-by: hieuvous <[email protected]> Co-authored-by: HLDKNotFound <[email protected]> Co-authored-by: Duyhung080205 <[email protected]> --------- Signed-off-by: sai.miduthuri <[email protected]> Signed-off-by: Seiji Eicher <[email protected]> Signed-off-by: Chaitanya Bharadwaj <[email protected]> Signed-off-by: Chaitanya Bharadwaj <[email protected]> Signed-off-by: harshit <[email protected]> Signed-off-by: zhilong <[email protected]> Signed-off-by: slxswaa1993 <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Signed-off-by: Jeffrey Wang <[email protected]> Signed-off-by: elliot-barn <[email protected]> Signed-off-by: Vaishnavi Panchavati <[email protected]> Signed-off-by: Kourosh Hakhamaneshi <[email protected]> Signed-off-by: Rueian Huang <[email protected]> Signed-off-by: Rueian <[email protected]> Signed-off-by: carolynwang <[email protected]> Signed-off-by: davik <[email protected]> Signed-off-by: Mark Towers <[email protected]> Signed-off-by: Lonnie Liu <[email protected]> Signed-off-by: Goutam <[email protected]> Signed-off-by: Aydin Abiar <[email protected]> Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Signed-off-by: yaommen <[email protected]> Signed-off-by: andrew <[email protected]> Signed-off-by: Cursx <[email protected]> Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Signed-off-by: hieuvous <[email protected]> Signed-off-by: chichic21039 <[email protected]> Signed-off-by: Duyhung080205 <[email protected]> Signed-off-by: Duy Hưng <[email protected]> Co-authored-by: Sai Miduthuri <[email protected]> Co-authored-by: Claude Opus 4.7 <[email protected]> Co-authored-by: Seiji Eicher <[email protected]> Co-authored-by: Goutam <[email protected]> Co-authored-by: Chaitanya Bharadwaj <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: harshit-anyscale <[email protected]> Co-authored-by: zhilong <[email protected]> Co-authored-by: slxswaa <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: Jeffrey Wang <[email protected]> Co-authored-by: Elliot Barnwell <[email protected]> Co-authored-by: Vaishnavi Panchavati <[email protected]> Co-authored-by: Kourosh Hakhamaneshi <[email protected]> Co-authored-by: Rueian <[email protected]> Co-authored-by: Carolyn Wang <[email protected]> Co-authored-by: Mengjin Yan <[email protected]> Co-authored-by: Kunchen (David) Dai <[email protected]> Co-authored-by: davik <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Lonnie Liu <[email protected]> Co-authored-by: Aydin Abiar <[email protected]> Co-authored-by: Jason Li <[email protected]> Co-authored-by: yaommen <[email protected]> Co-authored-by: Andrew Pollack-Gray <[email protected]> Co-authored-by: Lonnie Liu <[email protected]> Co-authored-by: Cursx <[email protected]> Co-authored-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]> Co-authored-by: hieuvous <[email protected]> Co-authored-by: chichic21039 <[email protected]> Co-authored-by: hieuvous <[email protected]> Co-authored-by: HLDKNotFound <[email protected]> Co-authored-by: Duyhung080205 <[email protected]>
…d_py3.13.txt and depsets (ray-project#62864) Refreshes `requirements_compiled_py3.13.txt` and the full set of raydepsets locks against current source pins, and adds the supporting CI plumbing and source-file changes needed to make the py3.13 lock resolvable as a constraint across all py3.10/3.11/3.12/3.13 depsets. ## CI infrastructure - **`.buildkite/dependencies.rayci.yml`** — new `pip_compile_313_dependencies` Buildkite step (mirror of the existing 3.11 compile job). Runs `compile_313_pip_dependencies`, uploads the artifact, and fails the build if `requirements_compiled_py3.13.txt` drifts from source. - **`ci/ci.sh`** — new `compile_313_pip_dependencies()` function that points pip-compile at the `python/requirements/py313/` and `python/requirements/ml/py313/` overrides and emits `requirements_compiled_py3.13.txt`. ## Source-file pins These drive the lock changes — no manual edits to the generated lock files. ### `python/requirements/py313/test-requirements.txt` - `fastapi==0.121.0` — FastAPI 0.125+ removed `pydantic.v1` route support; `test_pydantic_serialization` still uses v1 BaseModel. - `asgiref==3.9.2` — 3.10+ regresses Serve direct-ingress timeout / disconnect handling. - `redis==4.5.4` — TLS test compatibility. - `opentelemetry-proto==1.39.0` and `opentelemetry-exporter-otlp-proto-grpc==1.39.0` — co-pinned with `opentelemetry-sdk` so vllm (rayllm depset) can satisfy the in-family pins. - `grpcio==1.76.0` + matching `grpcio-tools` / `grpcio-status` — bisecting `test_raylet_and_agent_share_fate` against grpcio 1.80 startup cost on the runtime-env agent. - `jsonschema>=4.23.0,<4.25.0` — 4.25 introduced `rfc3987-syntax` which pins `lark==1.3.1`, conflicting with vllm's `lark==1.2.2`. - Dual `python_version`-marker pins for `protobuf`, `scipy`, `contourpy`, `networkx` — these packages dropped py3.10 wheels at the same time the py3.13 lock needed newer floors. Dual pinning preserves the cross-py-version compat path when the py3.13 lock is consumed as a constraint by py3.10 depsets. ### `python/requirements/ml/py313/` - `data-requirements.txt` — `lance-namespace==0.6.1`. - `dl-cpu-requirements.txt` / `dl-gpu-requirements.txt` — `nvidia-nccl-cu12` aligned across CPU/GPU so the CPU-built lock doesn't pin a version that conflicts with cu128 torch in GPU depsets. - `ml-requirements.txt` — dual `keras` pin (3.12.1 for py<3.11, 3.14.0 for py>=3.11); keras 3.13 dropped py3.10. - `rllib-requirements.txt` — dual `onnxruntime` pin (1.20.0 / 1.24.4) keyed on python version. - `train-requirements.txt` — `datasets==3.6.0`. ### `python/requirements/data/` - `pyarrow-latest.txt` — added `delta-sharing`. - `pyarrow-v9.txt` — pinned `datasets==2.14.4`, added `delta-sharing`. ## Depsets config **`ci/raydepsets/configs/ci_data.depsets.yaml`** — added relax entries so v9 / tfxbsl resolves can downgrade chains together: - `relaxed_data`: relaxed `delta-sharing`, `dill`, `multiprocess` (datasets 2.14.4 caps `dill<0.3.8` but py313 lock has `dill==0.4.1`). - `relaxed_data_tfxbsl`: relaxed `absl-py`, `grpcio-status`, `contourpy`, `scipy`, `delta-sharing` (tfx-bsl 1.16.x caps `absl-py<2.0.0` and `protobuf<6`; contourpy 1.3.3 + apache-beam 2.53.0 numpy clash). ## Lock files Regenerated `requirements_compiled_py3.13.txt` and ~70 depset locks under `python/deplocks/` (base / ci / llm / ray_img / docs). --------- Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>


Refreshes
requirements_compiled_py3.13.txtand the full set of raydepsets locks against current source pins, and adds the supporting CI plumbing and source-file changes needed to make the py3.13 lock resolvable as a constraint across all py3.10/3.11/3.12/3.13 depsets.CI infrastructure
.buildkite/dependencies.rayci.yml— newpip_compile_313_dependenciesBuildkite step (mirror of the existing 3.11 compile job). Runscompile_313_pip_dependencies, uploads the artifact, and fails the build ifrequirements_compiled_py3.13.txtdrifts from source.ci/ci.sh— newcompile_313_pip_dependencies()function that points pip-compile at thepython/requirements/py313/andpython/requirements/ml/py313/overrides and emitsrequirements_compiled_py3.13.txt.Source-file pins
These drive the lock changes — no manual edits to the generated lock files.
python/requirements/py313/test-requirements.txtfastapi==0.121.0— FastAPI 0.125+ removedpydantic.v1route support;test_pydantic_serializationstill uses v1 BaseModel.asgiref==3.9.2— 3.10+ regresses Serve direct-ingress timeout / disconnect handling.redis==4.5.4— TLS test compatibility.opentelemetry-proto==1.39.0andopentelemetry-exporter-otlp-proto-grpc==1.39.0— co-pinned withopentelemetry-sdkso vllm (rayllm depset) can satisfy the in-family pins.grpcio==1.76.0+ matchinggrpcio-tools/grpcio-status— bisectingtest_raylet_and_agent_share_fateagainst grpcio 1.80 startup cost on the runtime-env agent.jsonschema>=4.23.0,<4.25.0— 4.25 introducedrfc3987-syntaxwhich pinslark==1.3.1, conflicting with vllm'slark==1.2.2.python_version-marker pins forprotobuf,scipy,contourpy,networkx— these packages dropped py3.10 wheels at the same time the py3.13 lock needed newer floors. Dual pinning preserves the cross-py-version compat path when the py3.13 lock is consumed as a constraint by py3.10 depsets.python/requirements/ml/py313/data-requirements.txt—lance-namespace==0.6.1.dl-cpu-requirements.txt/dl-gpu-requirements.txt—nvidia-nccl-cu12aligned across CPU/GPU so the CPU-built lock doesn't pin a version that conflicts with cu128 torch in GPU depsets.ml-requirements.txt— dualkeraspin (3.12.1 for py<3.11, 3.14.0 for py>=3.11); keras 3.13 dropped py3.10.rllib-requirements.txt— dualonnxruntimepin (1.20.0 / 1.24.4) keyed on python version.train-requirements.txt—datasets==3.6.0.python/requirements/data/pyarrow-latest.txt— addeddelta-sharing.pyarrow-v9.txt— pinneddatasets==2.14.4, addeddelta-sharing.Depsets config
ci/raydepsets/configs/ci_data.depsets.yaml— added relax entries so v9 / tfxbsl resolves can downgrade chains together:relaxed_data: relaxeddelta-sharing,dill,multiprocess(datasets 2.14.4 capsdill<0.3.8but py313 lock hasdill==0.4.1).relaxed_data_tfxbsl: relaxedabsl-py,grpcio-status,contourpy,scipy,delta-sharing(tfx-bsl 1.16.x capsabsl-py<2.0.0andprotobuf<6; contourpy 1.3.3 + apache-beam 2.53.0 numpy clash).Lock files
Regenerated
requirements_compiled_py3.13.txtand ~70 depset locks underpython/deplocks/(base / ci / llm / ray_img / docs).